Analysis & Predictions of the Impact of COVID-19

TEAM MEMBERS:

  • Anne Rother
  • Shivani Hegde
  • Mohammed Farhaan Shaikh
  • Maneendra Perera
  • Poornima Venkatesha

1 Overview and Motivation

The world is battling with an invisible and fatal enemy, trying to understand how to live with the danger presented by a virus. It has been named COVID-19, belonging to a family of coronavirus to be the seventh known to affect humans.The novel coronavirus outbreak originated in Wuhan, China and has been declared as a ‘pandemic’ by the World Health Organization (WHO).

Currently, being the only topic in everyone’s heart, mind and soul, we would like to analyse and understand the global impact of the novel coronavirus (COVID-19). Without any doubt, we can say that it has alarmed the wellbeing of people and has resulted in social, economic and political crisis. This inspires us to predict the number of cases in the future according to the current circumstance which will help the country to make appropriate decisions in order to control the spread of the virus. Also, understanding the cause and drift of the virus in different people is also one of the major concerns that everybody is curious to know about. Lastly, we would like to know the opinions and experience of people around the world during such a pandemic phase of life.

3 Initial Questions

To get an overview, we have divided it into three domains: Twitter, a medical data set of patients infected with novel coronavirus and the popular Johns Hopkins University dataset.
At the beginning of April our questions were still very general, such as Can we identify some trends?
In the course of the project and the daily news or restrictions caused by the virus, the questions in the above-mentioned areas have become more specific:

3.0.1 What will be the worldwide effect of the COVID-19?

Countries have different strategies to minimize the spread of the virus and prevent a second wave. In order to predict the future cases ( confirmed, recovered, dead and active ) country wise purely based on previous values of the time series, we have considered the ARIMA model. ARIMA, short for ‘AutoRegressive Integrated Moving Average’, is a forecasting algorithm which works on the idea that the information in the past values of the time series can alone be used to predict the future values.Forecasting will give us an idea of the extent to which a country will be affected if no changes are made to improve the current situation. For example., government of a country can decide if there should be a lockdown or not, travel restrictions in different countries etc.

3.0.2 Some people are killed whereas others are spared by COVID-19, why?

We analysed the patient dataset in order to gain better insights of the symptoms and main risk factors such as age, history of chronic diseases and travel developing more serious COVID-19 outcomes. This understanding is important in order to help the early detection of the virus keeping in mind that this virus is asymptotic. Different countries’ statutes vary, for example in Africa the advisors say that an early detection of the virus can help them develop a strategy which can minimize the hardship caused by the lockdowns. This is also the case in many other countries.

Analysing the different age groups since the risk of dying from the infection, and the likelihood of requiring intensive medical care significantly increases with age. Initially, we did the analysis of the patients across the globe, then we boiled it down to Wuhan, China where the mysterious pneumonia cases were first detected in December 2019. Transmission of the disease is primarily due to close contact with one and other, so we will look at the patients from various countries who visited Wuhan. Patients with a history of chronic diseases tend to be more vulnerable to this virus as they have a low immune system, so we wanted to analyse the severity or fatal rates of such patients.

3.0.3 What is the role of social media such as Twitter during the pandemic?

While this pandemic has kept on influencing the lives of millions, a number of nations have turned to total lockdown. During quarantine, people have taken social media to express their feelings by support, creating awareness and entertainment. We have taken one of the social networking services and a microblogging platform Twitter in order to analyze the popular COVID-19 trends and engagement of people. Based on the tweets, we would like to see the different behaviors and reactions exhibited worldwide. People are different and they behave differently during difficult situations. Therefore it is interesting to know how people have expressed their emotions and how they have dealt with this unfortunate situation. Furthermore, it is curious to know what are the common words that public has used to express their feelings. Therefore, here we would like to perform sentiment analysis on the tweet texts retrieved from Twitter using Twitter API.

4 Data

4.0.1 Novel Coronavirus 2019 Time Series Data on Cases

There are 3 datasets which consists of time series data that tracks the number of people affected by COVID-19 worldwide and it has following three main information: • Number of confirmed cases of Coronavirus infections • Number of deaths from Coronavirus infection • Number of recovered cases from Coronavirus infection

There was no preprocessing done here as the dataset was good to be worked on directly. Our time series dataset is a sequence where a metric is recorded on daily basis (e.g,similar to weather forecast).

Dataset can be downloaded here:
https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases

4.0.2 Patient Medical Data for Novel Coronavirus COVID-19

Medical records of patients infected with novel coronavirus COVID-19. This data was imported and made computable on May 27, 2020.

Dataset can be downloaded here
https://datarepository.wolframcloud.com/resources/Patient-Medical-Data-for-Novel-Coronavirus-COVID-19

This dataset has a variety of variables with different representations. We extracted the values from them to do the analysis.

4.0.3 Twitter Data

We will work on the most recent dataset aggregated from Twitter using tritteR and rtweet libraries within a particular timeframe.Here twitteR which provides an interface and access to Twitter web API, rtweet which acts as the client for Twitter’s REST and stream APIs will be used to retrieve data.

Data Cleaning was done by removing white Spaces, links, punctuation, stop words, retweets, emotions, mentions, control characters, digits and converting to lower case.

5 Exploratory Data Analysis

We performed exploratory data analysis on all the components of the datasets. Below are the sections of EDA:

  • Analysis on Patient Medical Data
  • Predictions using Time Series Data
  • Analysis on Twitter Data

5.1 Analysis on Patient Medical Data

5.1.1 All Patients by Age

Plotting all the Patients Age Distribution on Histogram

We compared the age distribution among the patients ( Child, Adult and Senior Adult) . The age group between 40 to 60 have been affected the most. From our analysis ans well as confirmed by International research the percentage of children among the confirmed COVID-19 patients is low, ranging from 1% in young children to 6% in older children. Children with COVID-19 do have the same symptoms as adults. The most common symptoms in children are coughing, fever and sore throat. Worldwide, very few children with COVID-19 have died.In clusters of patients, adults are almost always the source patient. On basis of this, decisions such as re-opening of schools and childcare facilities were made (RIVM, n.d.).

5.1.2 Chronic Diseases w.r.t Gender

Plot the distribution of Male and Female of any age with Chronic Diseases

Everyone exposed to virus are at risk. However, some people are more vulnerable than others to become severely ill and more in need for medical facilities. According to “Centers for Disease Control and Prevention” , people of any age with certain chronic diseases are at higher risk for severe illness from covid-19.

5.1.3 Frequency of Chronic Diseases

Plot to see the most common chronic diseases

When a person having chronic medical condition gets infected with coronavirus, they are likely to face an increased risk of developing severe symptoms. We try to find recovery to death ratio for patients with and without COVID-19. The ratio is 17:49 for patients with chronic disease and 53:8 for patients without chronic disease. This shows that people without chronic medical conditions have higher chances of survival than those who have it.

5.1.5 Word Cloud of Common Symptoms

Wordcloud for showing the overall most common symptoms

From our wordcloud we can say that cough and fever are most common symptoms among the patients. People generally develop signs and symptoms, including mild respiratory symptoms and fever, on an average of 5-6 days after infection (mean incubation period 5-6 days, range 1-14 days).

5.1.6 Distribution of Symptoms in Wuhan, China

Plotting pie chart to display the common symptoms seen initially in Wuhan, China which is the origin.

Based on this, awareness was created for the rest of the world during the initial stages reagrding these symptoms. It describes the highest ranked symptoms seen together in the origin of COVID-19 that is Wuhan,China.Covid-19 symptoms are non-specific and the disease presentation can range from no symptoms(asymptomatic) to severe pneumonia and death.

5.1.7 Patient Locations

Plotting of locations of patients on world map

This shows us the locations of patients of our across the globe plotted on the world map.The outbreak spread from the Chinese city of Wuhan to more than 180 countries and territories affecting every continent except Antartica. Efforts to stamp out the pneumonia-like illness have driven all the countries to enforce lockdowns, widespread halts of international travel, mass layoffs and shattered financial markets.

5.1.8 Analysis using t-tests and Box-plots

Plotting a box plot to see the average age group likely to recover

Plotting a box plot to see the average age group unlikely to recover

Performing t-tests to see how confident we are that older people are more likely to die than younger people from COVID-19.

## 
##  Welch Two Sample t-test
## 
## data:  dead$Age and alive$Age
## t = 14.821, df = 329.72, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  18.74940 24.48827
## sample estimates:
## mean of x mean of y 
##  65.58389  43.96505

Now here, we first make a separate data frame consisting of only age, gender and death columns. From this , we create a boxplot to analyse the age of people who have died and of those who did not. The rhombus in the boxplot shows the mean(y) value of those who survived as 43.9 , while the mean(x) for those who have died as 65.5. So, this indeed shows that older people have lower chances of survival.

But is this true universally for the population? How confident are we that this is true? So, we perform T-tests to guage our confidence and to see if we can trust the means we got. In this case, we will use a 95% confidence interval.

Looking at the confidence interval, we can say with 95% certainity, that the age difference between between patients who have died and those who have not is from 18.7 to 24.5 years. Now, if we look at the p-value. It is almost 0. This means that there is ~0% chance that we obtained such extreme result randomly from this sample under the null hypothesis (which is that the ages of the two groups are equal). For this reason, we can reasonably reject the null hypothesis (under the conventional significance level of 0.05) and say that people who have died from COVID-19 are indeed older than those who did not. Articles which says that men indeed do have a higher coronavirus death rate(Katie Polglase and Foster, n.d.).

5.1.9 Timeline of Patients

Covid-19 is a confusing illness , wrapped with uncertainty. There has not been sufficient scientific studies to tell precisely how long it takes for a person to recover. While some possible ranges have been identified. These seems to vary from person to person. According to WHO, recovery times tend to be about 2 weeks for those with mild disease and about 3-2 weeks for those with severe/critical disease.

5.1.9.1 Recovered Patients

Plot to show the time taken for recovery by each patients

5.1.9.2 Deceased Patients

Plot to show the time taken for recovery by each patients

5.2 Predictions using Time series Data

We have chosen the ARIMA model to predict the cases of next ‘n’ days because the current situation is so uncertain, for eg., number of cases in Italy suddenly rose and it even beat China which was the origin, similarly lot of cases are increasing in India right now. So from this we can say that we should be ready to expect the unexpected. ARIMA models ceases any non-seasonal time series changes that is exhibited and gives a good prediction when compared to others which we used like Naive forecasting model that gave us similar results everyday since it only considers the last recent value and forecasts the current one. There is no adjustments made in case of any changes due to extrernal factors. We faced the same issue with Holt-Winters forecasting model as well.

To get an idea about how the virus could spread across a country in the next couple of days, we decided to create a predictor to forecast those values. The idea behind this was that, based on the current predictions, a country could take appropriate actions to cut down the spread of the virus like enforing a lockdown if the number of cases are increasing ih huge numbers or lift up the lockdown if things are getting better.

The name of a Country is taken as input, along with the type of cases to predict (confirmed, recovered, dead, active) and the number of days to predict and this generates a set of predictions based on the given input.

We pull the current John Hopkins dataset from their server and create required datasets from it

Global parameters used throughout the predictor code : We enter a country country for which the predictions need to be made, set the number of days to predict and also choose which category of cases to predict (confirmed, recovered, dead, active)

Filtering out data from the country selected as input. Generating the active cases dataset using the other three datasets (active_cases = confirmed - recovered - dead) then generating one data frame with all cases of a particular country

5.2.1 Generating the Time series Data

Generating a sequence of dates from first to last which is used to create the time series data.

## Time Series:
## Start = c(2020, 22) 
## End = c(2020, 27) 
## Frequency = 365 
##      confirmed
## [1,]         0
## [2,]         0
## [3,]         0
## [4,]         0
## [5,]         0
## [6,]         1

5.2.2 Arima Model for Forecasting

The time series data is fed into arima model to generate a prediction for the next n days mentioned in input.

5.2.4 Prediction of COVID-19 Cases

Plotting the time series of the actual cases followed by the predicted ones using line graph.

5.2.5 Summary of Predicted Values

Furthermore we developed a Shiny App found here

Using the Shiny App https://mohammed-shaikh.shinyapps.io/Covid/ , we can predict the different type of cases (confirmed, recovered, dead, active) for different countries between 1 to 15 days.

Based on (Gesundheit, n.d.) and (mdr, n.d.) we have identified points in time which are important steps towards the containment of the virus in Germany as well as in other countries of the world.
For a better overview, we have specialized in the following countries: India, Germany, Italy, Brazil, US and Spain.
The following table shows the events and dates presented in the interactive timeline.
Based on these dates we have created plots for confirmed,recovered and deaths, which include the countries listed above.

data_covid <- data.frame(
  id      = 1:25,
  content = c("China officially reports cases to WHO",
              "In France, the first evidence of the virus",
              "The virus has reached Germany",
              "The WHO declares a 'health emergency of international concern'",
              "France reports the first death in Europe",
              "Coronavirus infections have now been confirmed for the first time in Baden-Württemberg and North Rhine-Westphalia",
              "There are infections in about 60 countries. According to the WHO, there are around 3,000 deaths.",
              "The WHO declares a pandemic",
              "In most of the federal states schools and day care centres are already closed, others will follow. Also border controls and entry bans",
              "Italy is now the country with the most officially reported deaths worldwide",
              "Germany: Federal and state governments agree on strict exit and contact restrictions",
              "At over 140,000, more infections are now known in the USA than have been officially recorded in any other country in the world",
              "The nationwide contact restrictions are extended until 19 April.",
              "Start of the OVGU lecture period - digital",
              "Germany: The contact restrictions are extended until 3 May.",
              "Germany: first relaxation of corona protection measures",
              "Germany: In all federal states the mouth protection obligation applies",
              "Germany: further relaxation of corona restrictions",
              "Germany: end of controls at the German external borders (gradually)",
              "Brazil reports 15,000 new infections within 24 hours",
              "Germany: The number of new infections is below 1,000 (R) for the tenth consecutive day.",
              "The number of infections registered worldwide exceeds five million.",
              "Germany: economic stimulus package adopted",
              "China reports highest increase in new infections since April",
              "Germany: The government has launched the Corona Warning App"),
  start   = c("2019-12-31", "2020-01-24", "2020-01-27", "2020-01-30", "2020-02-15", "2020-02-24", "2020-03-02", 
              "2020-03-11", "2020-03-16", "2020-03-19", "2020-03-22", "2020-03-29", "2020-04-01", "2020-04-06",
              "2020-04-15", "2020-04-20", "2020-04-27", "2020-05-06", "2020-05-13", "2020-05-17", "2020-05-19",
              "2020-05-21", "2020-06-04", "2020-06-14", "2020-06-16"),
  end     = c(NA          ,           NA,           NA,    NA,    NA,    NA,    NA,    NA,    NA,    NA,    NA,    NA,    NA,    NA,    NA,    NA,    NA,    NA,    NA,    NA,    NA,    NA,    NA,    NA,    NA)
)
timeline <- timevis(data_covid)
## only confirmed ##

jc1 <- john_confirmed %>% filter(Country.Region == countries[1]) %>% select(all_of(selector))
jc2 <- john_confirmed %>% filter(Country.Region == countries[2]) %>% select(all_of(selector))
jc3 <- john_confirmed %>% filter(Country.Region == countries[3]) %>% select(all_of(selector))
jc4 <- john_confirmed %>% filter(Country.Region == countries[4]) %>% select(all_of(selector))
jc5 <- john_confirmed %>% filter(Country.Region == countries[5]) %>% select(all_of(selector))
jc6 <- john_confirmed %>% filter(Country.Region == countries[6]) %>% select(all_of(selector))
john_selected_confirmed <- t(rbind(jc1,jc2,jc3,jc4,jc5,jc6))
colnames(john_selected_confirmed) <- countries
confirmed.df <- as.data.frame(t(john_selected_confirmed))
confirmed.df$country <- countries

# Line plots with dots - confirmed
ggplot(confirmed.df, aes(x = country)) + 
 geom_line(aes(y=a,group=2, color = dates[1]), size = 0.5) +
  geom_line(aes(y=b,group=2, color = "2020-01-27"), size = 0.5) +
 geom_line(aes(y=c,group=2, color = "2020-01-30"), size = 0.5) +
 geom_line(aes(y=d,group=2, color = "2020-02-15"), size = 0.5) +
 geom_line(aes(y=e,group=2, color = "2020-02-24"), size = 0.5) +
 geom_line(aes(y=f,group=2, color = "2020-03-02"), size = 0.5) +
 geom_line(aes(y=g,group=2, color = "2020-03-11"), size = 0.5) +
  geom_line(aes(y=h,group=2, color = "2020-03-16"), size = 0.5) +
  geom_line(aes(y=i,group=2, color = "2020-03-19"), size = 0.5) +
  geom_line(aes(y=j,group=2, color = "2020-03-22"), size = 0.5) +
  geom_line(aes(y=k,group=2, color = "2020-03-29"), size = 0.5) +
  geom_line(aes(y=l,group=2, color = "2020-04-01"), size = 0.5) +
  geom_line(aes(y=m,group=2, color = "2020-04-06"), size = 0.5) +
  geom_line(aes(y=n,group=2, color = "2020-04-15"), size = 0.5) +
  geom_line(aes(y=o,group=2, color = "2020-04-20"), size = 0.5) +
  geom_line(aes(y=p,group=2, color = "2020-04-27"), size = 0.5) +
  geom_line(aes(y=q,group=2, color = "2020-05-06"), size = 0.5) +
  geom_line(aes(y=r,group=2, color = "2020-05-13"), size = 0.5) +
  geom_line(aes(y=s,group=2, color = "2020-05-17"), size = 0.5) +
  geom_line(aes(y=t,group=2, color = "2020-05-19"), size = 0.5) +
  geom_line(aes(y=u,group=2, color = "2020-05-21"), size = 0.5) +
  geom_line(aes(y=v,group=2, color = "2020-06-04"), size = 0.5) +
  geom_line(aes(y=w,group=2, color = "2020-06-14"), size = 0.5) +
  geom_line(aes(y=x,group=2, color = "2020-06-16"), size = 0.5) +

  geom_point(aes(y=a))+
  geom_point(aes(y=b))+
  geom_point(aes(y=c))+
  geom_point(aes(y=d))+
  geom_point(aes(y=e))+
  geom_point(aes(y=f))+
  geom_point(aes(y=g))+
  geom_point(aes(y=h))+
  geom_point(aes(y=i))+
  geom_point(aes(y=j))+
  geom_point(aes(y=k))+
  geom_point(aes(y=l))+
  geom_point(aes(y=m))+
  geom_point(aes(y=n))+
  geom_point(aes(y=o))+
  geom_point(aes(y=p))+
  geom_point(aes(y=q))+
  geom_point(aes(y=r))+
  geom_point(aes(y=s))+
  geom_point(aes(y=t))+
  geom_point(aes(y=u))+
  geom_point(aes(y=v))+
  geom_point(aes(y=w))+
  geom_point(aes(y=x))+

  labs(x="country", y="confirmed", color = "Legend") +
  coord_flip() +
  labs(title="Number of confirmed cases - in the space of time from 2020-01-24 to 2020-06-16", 
           subtitle="based on Johns Hopkins data set and the timline") +  
  theme(
    legend.position = c(.95, .80),
    legend.justification = c("right", "top") )

## only recovered ##
jr1 <- john_recovered %>% filter(Country.Region == countries[1]) %>% select(all_of(selector))
jr2 <- john_recovered %>% filter(Country.Region == countries[2]) %>% select(all_of(selector))
jr3 <- john_recovered %>% filter(Country.Region == countries[3]) %>% select(all_of(selector))
jr4 <- john_recovered %>% filter(Country.Region == countries[4]) %>% select(all_of(selector))
jr5 <- john_recovered %>% filter(Country.Region == countries[5]) %>% select(all_of(selector))
jr6 <- john_recovered %>% filter(Country.Region == countries[6]) %>% select(all_of(selector))
john_selected_recovered <- t(rbind(jr1,jr2,jr3,jr4,jr5,jr6))
colnames(john_selected_recovered) <- countries
recovered.df <- as.data.frame(t(john_selected_recovered))
recovered.df$country <- countries

# Line plots with dots - recovered
ggplot(recovered.df, aes(x = country)) + 
  geom_line(aes(y=a,group=2, color = "2020-01-24"), size = 0.5) +
  geom_line(aes(y=b,group=2, color = "2020-01-27"), size = 0.5) +
  geom_line(aes(y=c,group=2, color = "2020-01-30"), size = 0.5) +
  geom_line(aes(y=d,group=2, color = "2020-02-15"), size = 0.5) +
  geom_line(aes(y=e,group=2, color = "2020-02-24"), size = 0.5) +
  geom_line(aes(y=f,group=2, color = "2020-03-02"), size = 0.5) +
  geom_line(aes(y=g,group=2, color = "2020-03-11"), size = 0.5) +
  geom_line(aes(y=h,group=2, color = "2020-03-16"), size = 0.5) +
  geom_line(aes(y=i,group=2, color = "2020-03-19"), size = 0.5) +
  geom_line(aes(y=j,group=2, color = "2020-03-22"), size = 0.5) +
  geom_line(aes(y=k,group=2, color = "2020-03-29"), size = 0.5) +
  geom_line(aes(y=l,group=2, color = "2020-04-01"), size = 0.5) +
  geom_line(aes(y=m,group=2, color = "2020-04-06"), size = 0.5) +
  geom_line(aes(y=n,group=2, color = "2020-04-15"), size = 0.5) +
  geom_line(aes(y=o,group=2, color = "2020-04-20"), size = 0.5) +
  geom_line(aes(y=p,group=2, color = "2020-04-27"), size = 0.5) +
  geom_line(aes(y=q,group=2, color = "2020-05-06"), size = 0.5) +
  geom_line(aes(y=r,group=2, color = "2020-05-13"), size = 0.5) +
  geom_line(aes(y=s,group=2, color = "2020-05-17"), size = 0.5) +
  geom_line(aes(y=t,group=2, color = "2020-05-19"), size = 0.5) +
  geom_line(aes(y=u,group=2, color = "2020-05-21"), size = 0.5) +
  geom_line(aes(y=v,group=2, color = "2020-06-04"), size = 0.5) +
  geom_line(aes(y=w,group=2, color = "2020-06-14"), size = 0.5) +
  geom_line(aes(y=x,group=2, color = "2020-06-16"), size = 0.5) +
  
  
  geom_point(aes(y=a))+
  geom_point(aes(y=b))+
  geom_point(aes(y=c))+
  geom_point(aes(y=d))+
  geom_point(aes(y=e))+
  geom_point(aes(y=f))+ 
  geom_point(aes(y=g))+ 
  geom_point(aes(y=h))+ 
  geom_point(aes(y=i))+ 
  geom_point(aes(y=j))+ 
  geom_point(aes(y=k))+ 
  geom_point(aes(y=l))+ 
  geom_point(aes(y=m))+ 
  geom_point(aes(y=n))+ 
  geom_point(aes(y=o))+ 
  geom_point(aes(y=p))+ 
  geom_point(aes(y=q))+ 
  geom_point(aes(y=r))+ 
  geom_point(aes(y=s))+ 
  geom_point(aes(y=t))+ 
  geom_point(aes(y=u))+ 
  geom_point(aes(y=v))+ 
  geom_point(aes(y=w))+ 
  geom_point(aes(y=x))+
  
  labs(x="country", y="recovered", color = "Legend") +
  coord_flip() +
  labs(title="Number of recovered cases - in the space of time from 2020-01-24 to 2020-06-16", 
       subtitle="based on Johns Hopkins data set and the timline") +  
  # theme(legend.position="top")
  theme(
    legend.position = c(.95, .80),
    legend.justification = c("right", "top") )

## only deaths ##

jd1 <- john_deaths %>% filter(Country.Region == countries[1]) %>% select(all_of(selector))
jd2 <- john_deaths %>% filter(Country.Region == countries[2]) %>% select(all_of(selector))
jd3 <- john_deaths %>% filter(Country.Region == countries[3]) %>% select(all_of(selector))
jd4 <- john_deaths %>% filter(Country.Region == countries[4]) %>% select(all_of(selector))
jd5 <- john_deaths %>% filter(Country.Region == countries[5]) %>% select(all_of(selector))
jd6 <- john_deaths %>% filter(Country.Region == countries[6]) %>% select(all_of(selector))
john_selected_deaths <- t(rbind(jd1,jd2,jd3,jd4,jd5,jd6))
colnames(john_selected_deaths) <- countries
deaths.df <- as.data.frame(t(john_selected_deaths))
deaths.df$country <- countries

# Line plots with dots - deaths
ggplot(deaths.df, aes(x = country)) + 
  geom_line(aes(y=a,group=2, color = "2020-01-24"), size = 0.5) +
  geom_line(aes(y=b,group=2, color = "2020-01-27"), size = 0.5) +
  geom_line(aes(y=c,group=2, color = "2020-01-30"), size = 0.5) +
  geom_line(aes(y=d,group=2, color = "2020-02-15"), size = 0.5) +
  geom_line(aes(y=e,group=2, color = "2020-02-24"), size = 0.5) +
  geom_line(aes(y=f,group=2, color = "2020-03-02"), size = 0.5) +
  geom_line(aes(y=g,group=2, color = "2020-03-11"), size = 0.5) +
  geom_line(aes(y=h,group=2, color = "2020-03-16"), size = 0.5) +
  geom_line(aes(y=i,group=2, color = "2020-03-19"), size = 0.5) +
  geom_line(aes(y=j,group=2, color = "2020-03-22"), size = 0.5) +
  geom_line(aes(y=k,group=2, color = "2020-03-29"), size = 0.5) +
  geom_line(aes(y=l,group=2, color = "2020-04-01"), size = 0.5) +
  geom_line(aes(y=m,group=2, color = "2020-04-06"), size = 0.5) +
  geom_line(aes(y=n,group=2, color = "2020-04-15"), size = 0.5) +
  geom_line(aes(y=o,group=2, color = "2020-04-20"), size = 0.5) +
  geom_line(aes(y=p,group=2, color = "2020-04-27"), size = 0.5) +
  geom_line(aes(y=q,group=2, color = "2020-05-06"), size = 0.5) +
  geom_line(aes(y=r,group=2, color = "2020-05-13"), size = 0.5) +
  geom_line(aes(y=s,group=2, color = "2020-05-17"), size = 0.5) +
  geom_line(aes(y=t,group=2, color = "2020-05-19"), size = 0.5) +
  geom_line(aes(y=u,group=2, color = "2020-05-21"), size = 0.5) +
  geom_line(aes(y=v,group=2, color = "2020-06-04"), size = 0.5) +
  geom_line(aes(y=w,group=2, color = "2020-06-14"), size = 0.5) +
  geom_line(aes(y=x,group=2, color = "2020-06-16"), size = 0.5) +
  
  
  geom_point(aes(y=a))+
  geom_point(aes(y=b))+
  geom_point(aes(y=c))+
  geom_point(aes(y=d))+
  geom_point(aes(y=e))+
  geom_point(aes(y=f))+ 
  geom_point(aes(y=g))+ 
  geom_point(aes(y=h))+ 
  geom_point(aes(y=i))+ 
  geom_point(aes(y=j))+ 
  geom_point(aes(y=k))+ 
  geom_point(aes(y=l))+ 
  geom_point(aes(y=m))+ 
  geom_point(aes(y=n))+ 
  geom_point(aes(y=o))+ 
  geom_point(aes(y=p))+ 
  geom_point(aes(y=q))+ 
  geom_point(aes(y=r))+ 
  geom_point(aes(y=s))+ 
  geom_point(aes(y=t))+ 
  geom_point(aes(y=u))+ 
  geom_point(aes(y=v))+ 
  geom_point(aes(y=w))+ 
  geom_point(aes(y=x))+
  
  labs(x="country", y="deaths", color = "Legend") +
  coord_flip() +
  labs(title="Number of deaths - in the space of time from 2020-01-24 to 2020-06-16", 
       subtitle="based on Johns Hopkins data set and the timline") +  
  theme(
    legend.position = c(.95, .80),
    legend.justification = c("right", "top") )

5.3 Twitter Analysis

5.3.1 Visualize Words Contributed to Positive and Negative Sentiments

Here multiple graphs are plotted based on the tweet texts to identify the sentiment orientation and to analyze most frequent words that have been used to express particular emotions. For plotting the graphs, 10000 tweets are retrieved via TWitter Search API between 2020-03-14 and 2020-06-30.

Connect to twitter search API using twitteR

## [1] "Using direct authentication"

Here tweets are categorized as positive or negative and then what are the words that have contributed most for their sentiment. From this chart, we can analyze, what are the words people have used frequently to express their positive or negative feelings.

#Grabbing text data from tweets
corona.text <- sapply(corona.tweets, function(x) x$getText())

#Clean text data - remove emoticons and other symbols
corona.text <- iconv(corona.text,'UTF-8','ASCII')

#Remove twitter mentions
corona.text <- gsub("@[[:alpha:]]*","", corona.text)

# Removing blank spaces, punctuation, links, extra spaces, special characters and other unwanted things.
corona.text = gsub("[:blank:]", "", corona.text)
corona.text = gsub("[[:punct:]]", "", corona.text)
corona.text = gsub("[:cntrl:]", "", corona.text)
corona.text = gsub("[[:digit:]]", "", corona.text)
corona.text = gsub("[:blank:]", "", corona.text)
corona.text = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", " ",  corona.text)
corona.text = gsub("@\\w+", "", corona.text)
corona.text = gsub("http\\w+", "", corona.text)


corona.corpus <- Corpus(VectorSource(corona.text))

doc.term.matrix <- DocumentTermMatrix(corona.corpus,control = list(removePunctuation=T,
                                                                   stopwords = c("corona","covid19","pandemic","virus", "covid", "corona virus","covid19pandemic","stay home", "stay safe",'http','https',stopwords('en')),
                                                                   removeNumbers = T,
                                                                   tolower = T))
#Plot the graph
corona.doc.term.matrix <- tidy(doc.term.matrix)
ap_sentiments <- corona.doc.term.matrix %>%
    inner_join(get_sentiments("bing"), by = c(term = "word"))

ap_sentiments %>%
  dplyr::count(sentiment, term, wt = count) %>%
  filter(n >= 1) %>%
  mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
  mutate(term = reorder(term, n)) %>%
  ggplot2::ggplot(aes(term, n, fill = sentiment)) +
  scale_fill_discrete(name = "Sentiment", labels = c("Negative", "Positive")) + 
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  xlab("Words") +
  ylab("Contribution to Sentiment")

5.3.2 Bar Plot for Emotion and Frequency

Here tweets are categorized and analyzed for eight emotions “anger”, “anticipation”, “disgust”, “fear”, “joy”, “sadness”, “surprise”, “trust”, “negative”, “positive” using NRC sentiment dictionary and bar graph is plotted for each sentiment. This chart will give us a low level detail description about the emotions that people had during the pandemic period.

5.3.4 Timeline of tweets

corona_df = ldply(corona.tweets, function(t) t$toDataFrame())

#Check when tweet the most
corona_df$date <- day(corona_df$created)
corona_df$hour <- hour(corona_df$created)

#Cleaning
#Remove twitter mentions
corona_df$text <- gsub("@[[:alpha:]]*","", corona_df$text)
#Remove URLs
corona_df$text = gsub("http[^[:space:]]*", "",corona_df$text)


#Remove retweet entities 
corona_df$text = gsub("(RT|via)((?:\\b\\W*@\\w+)+)"," ",corona_df$text)

#Remove quotes
corona_df$text = gsub("'s|'s|[...]", "", corona_df$text)

#Remove at people 
corona_df$text = gsub("@\\w+", " ", corona_df$text)
#Remove punctuation 
corona_df$text = gsub("[[:punct:]]", " ", corona_df$text)

#Remove single letters.
corona_df$text = gsub(" *\\b[[:alpha:]]{1}\\b *", "", corona_df$text)

#Remove unnecessary spaces
corona_df$text = gsub("[ \t]{2,}", " ", corona_df$text)

#Remove leading and trailing whitespaces 
corona_df$text = gsub("^\\s+|\\s+$", "", corona_df$text)

#Convert text object to corpus object to be recognized by tm
text_corpus <- Corpus(VectorSource(corona_df$text))
#Convert text to lower case
text_corpus <- tm_map(text_corpus, tolower)
#Remove words that we searched
text_corpus <- tm_map(text_corpus, removeWords, c("corona","covid19","pandemic","virus", "covid19", "corona virus","covid19pandemic","stay home", "stay safe",'http','https'))
#Remove english stop words
text_corpus <- tm_map(text_corpus, removeWords, stopwords("english"))
#Remove punctuation
text_corpus <- tm_map(text_corpus, removePunctuation)
#Remove whitespaces
text_corpus = tm_map(text_corpus, stripWhitespace)
#Stem words in the corpus 
text_corpus<-tm_map(text_corpus, stemDocument)

#Cleaned corpus to data frame
text_df <- data.frame(text_clean = get("content", text_corpus), stringsAsFactors = FALSE)
#Bind data frame with orginal
corona_df <- cbind.data.frame(corona_df, text_df)
#Sentiment analysis
corona_sentiment <- analyzeSentiment(corona_df$text_clean)
corona_sentiment <- dplyr::select(corona_sentiment, 
                                  SentimentGI, SentimentHE,
                                  SentimentLM, SentimentQDAP, 
                                  WordCount)
corona_sentiment <- dplyr::mutate(corona_sentiment, 
                                  mean_sentiment = rowMeans(corona_sentiment[,-5]))
corona_sentiment <- dplyr::select(corona_sentiment, 
                                  WordCount, 
                                  mean_sentiment)

corona_df <- cbind.data.frame(corona_df, corona_sentiment)
full_sentiment <- dplyr::mutate(corona_df, sentiment_score = if_else(mean_sentiment < 0, -1, 1) )
full_sentiment <- dplyr::mutate(full_sentiment, sentiment = if_else(mean_sentiment < 0, "Negative", "Positive") )
full_sentiment <- dplyr::filter(full_sentiment, sentiment == "Negative" | sentiment == "Positive")

ggplot(full_sentiment, aes(x = hour , fill=sentiment)) + 
  ggtitle("Hourly Positive and Negative Tweet Counts") + 
  geom_line(aes(color = sentiment), size = 2, stat = 'count') +
  scale_fill_discrete(name = "Sentiment", labels = c("Negative", "Positive")) +
  xlab("Hour") + 
  ylab("Tweet Count") + 
  ylim(c(300, 1200)) + 
  scale_color_manual(values = c("#00AFBB", "#E7B800")) 

Here the timeline will show how the positive and negative tweets have changed with time. Here the chart is created for tweets on 29th June 2020 during 21:00 and 23:00 hour,and it shows how the sentiment counts have varied with the time.

5.3.5 Analysis of German Tweets Word Cloud

Here a word cloud is created by retrieving German tweets. This will give us a idea, what words are used frequently in Germany during the crisis situation.

## [1] "Using direct authentication"

Here, the same German word cloud is plotted in a different theme, which is more user friendly and easy to identify the frequent words.

6 Final analysis

6.0.1 Patient Medical Data

Our overall goal was to exploit the sample patient dataset. Each observation of the dataset is related to the details of the confirmed COVID-19 patient such as age, DatOfOnsetSymptoms , DateOfDischarge etc. Among 15466 observations, 6006 (38.83%)are Females and 9460 (61.17%) are Males. 766(4.95%) are 18 and below, 12059(77.97%) are between 20 and 60 inclusive, 2641(17.07%) above 60. The low count of children suggests that there is a relatively low attack rate in this age group. The median age is 45 years (range 1 year-100 years old; IQR 38-53 years old) with the majority of cases aged between 16–75 years.

Individuals with higher risk for severe disease and death include those with some underlying medical conditions such as hypertension, diabetes, cardiovascular disease, chronic respiratory disease and cancer. Our sample consists of 720(4.6%) patients with chronic diseases out of which 302(41.94%) are Female and 418(58.06%) are Male. Additionally, extracting each chronic disease from 149 observations(excluding missing values) and plotting its frequency showed us that people with hypertension(34.7%)and diabetes(24.3%) are more vulnerable to COVID-19.

Since people with chronic disease are likely to face an increased risk of developing severe symptoms and eventually die, we try to find the chances of their recovery and also compare it with those who don’t. The recovery:death ratio for patients with chronic disease is 17:49 and for the others is 53:8. This indeed proves that people having chronic disease, when infected by COVID19 have very low chances of recovery.

Based on 1644 confirmed cases (excluding observations with missing values for Symptoms) collected until March 2020, typical signs and symptoms include: Fever (32.65%), Dry Cough(18.38%), sore throat,(3.76%) pneumonia (3.5%), fatigue(2.5%), malaise(2.5%), rhinorrhea(2.3%), headache(2.23%), myalgias(2.22%), shortness of breath(1.9%), sputum(1.5%) etc. Focusing on Wuhan City, we plot a pie chart to see the initially seen symptoms.

Fever was seen in the majority of the cases (44.4%) on its own as well as with other symptoms like cough(28.1%), weakness(2.96%), sore throat(2.96%) and fatigue(2.96%).The outbreak soon spread from China to other parts of the world. We use the geographical locations of the patients provided in our dataset to find the places that were affected or not affected from COVID-19. The map reveals that the virus was spread from Chinese city of Wuhan to more than 180 countries and territories affecting every continent except Antarctica. In addition to chronic disease, age also influences the level of risk for disease and death.People aged more than 60 are at a higher risk than those below 60 can be concluded with the help of the statistical hypothesis testing such as t-test. Creating a separate data frame of those who recovered (372 observations) , we create a boxplot that shows the median age as 45 (IQR 30-53) and also the average age to be 43.29. We then create another data frame of those who died and created a boxplot. The median in this plot is 67 (IQR 55-79).

According to WHO, the recovery time tends to be about two weeks for those with mild symptoms and about 3-6 weeks for those with severe or critical disease. However, these seem to be only rough guidelines as studies have already shown a number of exceptions. We plot the timeline for recovery for some patients and see variations in the number of days taken for recovery. With this, we can conclude that a window of 2-4 weeks can be considered as recovery time. Similarly, we plot the timeline for death for some patients. In this case, most of the patients died within 3 weeks whereas the majority of the patients older than 70 took less than 2 weeks.These are analysis that we have made on the COVID-19 patient medical data.

6.0.2 Time Series Prediction

We need to know that no prediction is certain as long as once in a while the past repeats. There are different factors that come into play while doing the prediction such as psychological which emphasizes more on how people distinguish and react in a dangerous situation, availability of data and the variable used. Assuming that the information used is reliable which in future will follow the past trends of the disease, our forecasts say that there will be an increment within the confirmed COVID-19 cases ( deaths and recovered ) with a slight instability.
We can see that in Germany the restrictions have taken important steps towards the containment of the virus. This has led to fewer deaths and confirmed cases, as for example in the US.
It is interesting to note that the strict Spanish restrictions on the virus have made only a small difference to the less stringent restrictions in Germany (confirmed cases).

7 Twitter Analysis

The objective of this Twitter sentiment analysis was to identify the emotions and sentiment direction of the public during the corona virus outbreak. Based on the plots, it is revealed that people had more positive sentiment towards the situation rather than the negative feelings. Furthermore , it is identified that anticipation and trust are the most expressed emotions during the pandemic. When analyzed the frequent words used to express sentiments, it is found that “ugh”, “die”, “miss” , “sue”, “worse” are the words used frequently to express negative sentiment while, “good”, “trump”, “love”, “hug”, “wow” are the words used for positive sentiment. When analyzed the word cloud for German tweets, we could identify some of the words like “schon”, “deutschland”, “pandamie”, “lockdown”, “youtube” etc have been used frequently in the tweets.

References

Dey, Samrat K, Md Mahbubur Rahman, Umme R Siddiqi, and Arpita Howlader. 2020. “Analyzing the Epidemiological Outbreak of Covid-19: A Visual Exploratory Data Analysis Approach.” Journal of Medical Virology 92 (6): 632–38.

Dubey, Akash Dutt. 2020. “Twitter Sentiment Analysis During Covid19 Outbreak.” Available at SSRN 3572023.

Gesundheit, Bundesministerium für. n.d. “Coronavirus Sars-Cov-2: Chronik Der Bisherigen Maßnahmen.” https://www.bundesgesundheitsministerium.de/coronavirus/chronik-coronavirus.html.

Gupta, Rajan, and Saibal Kumar Pal. 2020. “Trend Analysis and Forecasting of Covid-19 Outbreak in India.” medRxiv.

Katie Polglase, Gianluca Mezzofiore, and Max Foster. n.d. “Here’s Why the Coronavirus May Be Killing More Men Than Women. The Us Should Take Note.” https://edition.cnn.com/2020/03/24/health/coronavirus-gender-mortality-intl/index.html.

Manguri, Kamaran H, Rebaz N Ramadhan, and Pshko R Mohammed Amin. 2020. “Twitter Sentiment Analysis on Worldwide Covid-19 Outbreaks.” Kurdistan Journal of Applied Research, 54–65.

Marcel Salathé, Nicky Case. n.d. “What Happens Next?” https://ncase.me/covid-19/.

Simon Haas, Robert Meyer. n.d. “Alle Zahlen Und Grafiken Zum Coronavirus.” https://www.zdf.de/nachrichten/heute/coronavirus-ausbreitung-infografiken-102.html.

Singh, Sarbjit, Kulwinder Singh Parmar, Jatinder Kumar, and Sidhu Jitendra Singh Makkhan. 2020. “Development of New Hybrid Model of Discrete Wavelet Decomposition and Autoregressive Integrated Moving Average (Arima) Models in Application to One Month Forecast the Casualties Cases of Covid-19.” Chaos, Solitons & Fractals, 109866.

Solonko, Mykyta. n.d. “R Tutorial: Analyzing Covid-19 Data.” https://towardsdatascience.com/r-tutorial-analyzing-covid-19-data-12670cd664d6.